Introduction

This study aims to explore, clean, and prepare a dataset to effectively find correlations among the variables and classify individuals into three categories: Diabetic, Non-Diabetic, and Prediabetic. The motivation behind this project is to gain experience in data analysis and machine learning techniques while also providing a reference for others interested in similar topics.

The data we have chosen for this project was extracted from https://hbiostat.org/data/, provided by Dr. John Schorling from the Department of Medicine, School of Medicine at the University of Virginia. This data comes from a study aimed at understanding the prevalence of obesity and diabetes among African Americans, as well as other cardiovascular risk factors. More information about the data collection process can be found in the article: Willems JP, Saunders JT, DE Hunt, JB Schorling: “Prevalence of coronary heart disease risk factors among rural blacks: A community-based study.” Southern Medical Journal 90:814-820; 1997.

This project is devided into 4 main parts:

  1. Raw Data Exploration and Ceaning: this section includes the typical first approach to the data. We will inspect the dimensions of the dataframe, explore what variables are included and what values they take and if necessary, the data will be reformated, variables will be eliminated if they do not provide useful information for the analysis, and NA’s values will be treated as convenient.

  2. Correlation Exploration: in this section we will try to look for relations among the variables for our dataset. For this, the choosing of an appropiate statistical test will be necessary.

  3. Principal Component Analysis: a principal component analysis (PCA) will be performed to check whether dimensionaitly of our data can be reduced without loosing too much information.

  4. SVM, k-NN and Decison Tree: lastly, we will train 3 machine learning models after the dimesionality has been reduced and we will test how efficient they are in predicting the variable response.

Raw Data Exploration and Cleaning

Initial exploration

We will first load our data into our workspace.

library(readr)
diabetes <- read_csv("/Users/inesmora/Downloads/diabetes.csv")
## Rows: 403 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): location, gender, frame
## dbl (16): id, chol, stab.glu, hdl, ratio, glyhb, age, height, weight, bp.1s,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Once the dataframe is loaded, we can fist use de function head() to see the first rows and get an idea how the data is structured.

head(diabetes)
## # A tibble: 6 × 19
##      id  chol stab.glu   hdl ratio glyhb location     age gender height weight
##   <dbl> <dbl>    <dbl> <dbl> <dbl> <dbl> <chr>      <dbl> <chr>   <dbl>  <dbl>
## 1  1000   203       82    56  3.60  4.31 Buckingham    46 female     62    121
## 2  1001   165       97    24  6.90  4.44 Buckingham    29 female     64    218
## 3  1002   228       92    37  6.20  4.64 Buckingham    58 female     61    256
## 4  1003    78       93    12  6.5   4.63 Buckingham    67 male       67    119
## 5  1005   249       90    28  8.90  7.72 Buckingham    64 male       68    183
## 6  1008   248       94    69  3.60  4.81 Buckingham    34 male       71    190
## # ℹ 8 more variables: frame <chr>, bp.1s <dbl>, bp.1d <dbl>, bp.2s <dbl>,
## #   bp.2d <dbl>, waist <dbl>, hip <dbl>, time.ppn <dbl>

Our data contains variables that provide us with information related to individuals’ health and physical characteristics, such as cholesterol levels, glycated hemoglobin levels, weight, height, etc. Let’s check the dimentions of our dataframe:

dim(diabetes)
## [1] 403  19

Our data frame contains 403 rows (samples or individuals) and 19 columns (variables). Let’s explore whether NA or NULL values are present in our data.

table(is.na(diabetes))
## 
## FALSE  TRUE 
##  7082   575
table(is.null(diabetes))
## 
## FALSE 
##     1

We have 575 NA values and 0 NULL values.

We can also use the function str()to quickly explore the type of data contained in each variable:

str(diabetes)
## spc_tbl_ [403 × 19] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id      : num [1:403] 1000 1001 1002 1003 1005 ...
##  $ chol    : num [1:403] 203 165 228 78 249 248 195 227 177 263 ...
##  $ stab.glu: num [1:403] 82 97 92 93 90 94 92 75 87 89 ...
##  $ hdl     : num [1:403] 56 24 37 12 28 69 41 44 49 40 ...
##  $ ratio   : num [1:403] 3.6 6.9 6.2 6.5 8.9 ...
##  $ glyhb   : num [1:403] 4.31 4.44 4.64 4.63 7.72 ...
##  $ location: chr [1:403] "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
##  $ age     : num [1:403] 46 29 58 67 64 34 30 37 45 55 ...
##  $ gender  : chr [1:403] "female" "female" "female" "male" ...
##  $ height  : num [1:403] 62 64 61 67 68 71 69 59 69 63 ...
##  $ weight  : num [1:403] 121 218 256 119 183 190 191 170 166 202 ...
##  $ frame   : chr [1:403] "medium" "large" "large" "large" ...
##  $ bp.1s   : num [1:403] 118 112 190 110 138 132 161 NA 160 108 ...
##  $ bp.1d   : num [1:403] 59 68 92 50 80 86 112 NA 80 72 ...
##  $ bp.2s   : num [1:403] NA NA 185 NA NA NA 161 NA 128 NA ...
##  $ bp.2d   : num [1:403] NA NA 92 NA NA NA 112 NA 86 NA ...
##  $ waist   : num [1:403] 29 46 49 33 44 36 46 34 34 45 ...
##  $ hip     : num [1:403] 38 48 57 38 41 42 49 39 40 50 ...
##  $ time.ppn: num [1:403] 720 360 180 480 300 195 720 1020 300 240 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   chol = col_double(),
##   ..   stab.glu = col_double(),
##   ..   hdl = col_double(),
##   ..   ratio = col_double(),
##   ..   glyhb = col_double(),
##   ..   location = col_character(),
##   ..   age = col_double(),
##   ..   gender = col_character(),
##   ..   height = col_double(),
##   ..   weight = col_double(),
##   ..   frame = col_character(),
##   ..   bp.1s = col_double(),
##   ..   bp.1d = col_double(),
##   ..   bp.2s = col_double(),
##   ..   bp.2d = col_double(),
##   ..   waist = col_double(),
##   ..   hip = col_double(),
##   ..   time.ppn = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We have 3 variables type chr and the rest are numerical variables. Within the first samples, we can see a large concentration of NA values in the variables bp.2sand bp.2d.

With the function summary()we can see statistical summary of our data, included the quartiles, the mean, and the number of NA’s:

summary(diabetes)
##        id             chol          stab.glu          hdl        
##  Min.   : 1000   Min.   : 78.0   Min.   : 48.0   Min.   : 12.00  
##  1st Qu.: 4792   1st Qu.:179.0   1st Qu.: 81.0   1st Qu.: 38.00  
##  Median :15766   Median :204.0   Median : 89.0   Median : 46.00  
##  Mean   :15978   Mean   :207.8   Mean   :106.7   Mean   : 50.45  
##  3rd Qu.:20336   3rd Qu.:230.0   3rd Qu.:106.0   3rd Qu.: 59.00  
##  Max.   :41756   Max.   :443.0   Max.   :385.0   Max.   :120.00  
##                  NA's   :1                       NA's   :1       
##      ratio            glyhb         location              age       
##  Min.   : 1.500   Min.   : 2.68   Length:403         Min.   :19.00  
##  1st Qu.: 3.200   1st Qu.: 4.38   Class :character   1st Qu.:34.00  
##  Median : 4.200   Median : 4.84   Mode  :character   Median :45.00  
##  Mean   : 4.522   Mean   : 5.59                      Mean   :46.85  
##  3rd Qu.: 5.400   3rd Qu.: 5.60                      3rd Qu.:60.00  
##  Max.   :19.300   Max.   :16.11                      Max.   :92.00  
##  NA's   :1        NA's   :13                                        
##     gender              height          weight         frame          
##  Length:403         Min.   :52.00   Min.   : 99.0   Length:403        
##  Class :character   1st Qu.:63.00   1st Qu.:151.0   Class :character  
##  Mode  :character   Median :66.00   Median :172.5   Mode  :character  
##                     Mean   :66.02   Mean   :177.6                     
##                     3rd Qu.:69.00   3rd Qu.:200.0                     
##                     Max.   :76.00   Max.   :325.0                     
##                     NA's   :5       NA's   :1                         
##      bp.1s           bp.1d            bp.2s           bp.2d       
##  Min.   : 90.0   Min.   : 48.00   Min.   :110.0   Min.   : 60.00  
##  1st Qu.:121.2   1st Qu.: 75.00   1st Qu.:138.0   1st Qu.: 84.00  
##  Median :136.0   Median : 82.00   Median :149.0   Median : 92.00  
##  Mean   :136.9   Mean   : 83.32   Mean   :152.4   Mean   : 92.52  
##  3rd Qu.:146.8   3rd Qu.: 90.00   3rd Qu.:161.0   3rd Qu.:100.00  
##  Max.   :250.0   Max.   :124.00   Max.   :238.0   Max.   :124.00  
##  NA's   :5       NA's   :5        NA's   :262     NA's   :262     
##      waist           hip           time.ppn     
##  Min.   :26.0   Min.   :30.00   Min.   :   5.0  
##  1st Qu.:33.0   1st Qu.:39.00   1st Qu.:  90.0  
##  Median :37.0   Median :42.00   Median : 240.0  
##  Mean   :37.9   Mean   :43.04   Mean   : 341.2  
##  3rd Qu.:41.0   3rd Qu.:46.00   3rd Qu.: 517.5  
##  Max.   :56.0   Max.   :64.00   Max.   :1560.0  
##  NA's   :2      NA's   :2       NA's   :3

The variables bp.2sadn bp.2d contain most of the NA’s values present in our data. Since a very large number (more than half of the sample) of NAs are concentrated in two variables, removing all observations containing these NAs could result in losing valuable information for our study contained in the other variables; we would reduce the number of samples from 403 to 130.

diabetes_noNA <- na.omit(diabetes)
dim(diabetes_noNA)
## [1] 130  19

If these variables are not relevant for our study, we can decide to eliminate them and then omit NA values. However, in this case an explanation was provided by the authors of the data. The article by Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997 explains that blood pressure was measured approximately 15 minutes after the examination began (bp.1), and if the pressure was greater than or equal to 140/90 mmHg, a second measurement was taken approximately 10 minutes later (bp.2). For these cases, the blood pressure considered by the researchers is the average of both measurements.

Now that we understand why there are so many NA values in these fields, we can conveniently reformulate our table and then omit the NA values.

Data Cleaning and Reformating

Fistly we will change the data to the metric system (optional) and we will create a couple of new variables:

diabetes$weight <- diabetes$weight * 0.453592 # From pounds to Kg
diabetes$height <- diabetes$height + 100 # This variable equals height - 100, we add the 100 back
diabetes$BMI <- with(diabetes, weight / ((height)/100)^2) # We can create a variable with the BMI for each sample

diabetes$waist <- diabetes$waist * 2.54 # From inches to cm
diabetes$hip <- diabetes$hip * 2.54 # From inches to cm
diabetes$waist_to_hip <- diabetes$waist / diabetes$hip # We can create a variable for hip to waist ratio
  1. We create a variable containing the means of the sistolics and diastolics pressures. In case one of the two values is missing, we keep just one. We will then delete the variables that will be of no use for our analysis.
diabetes$bp.s <- ifelse(!is.na(diabetes$bp.1s) & !is.na(diabetes$bp.2s),
                        (diabetes$bp.1s + diabetes$bp.2s) / 2,
                        ifelse(!is.na(diabetes$bp.1s), diabetes$bp.1s,
                               diabetes$bp.2s))

diabetes$bp.d <- ifelse(!is.na(diabetes$bp.1d) & !is.na(diabetes$bp.2d),
                        (diabetes$bp.1d + diabetes$bp.2d) / 2,
                        ifelse(!is.na(diabetes$bp.1d), diabetes$bp.1d,
                               diabetes$bp.2d))

columns_to_delete <- c("bp.1s", "bp.2s", "bp.1d", "bp.2d", "time.ppn", "id", "gender", "frame", "location")

# We can now eliminate the columns "bp.1s", "bp.2s", "bp.1d" y "bp.2d", since we have incorporated the variables "bp.s" and "bp.d". We also remove other variables that we won't be using in our analysis
diabetes <- diabetes[, !(names(diabetes) %in% columns_to_delete)]
  1. We omit NA values:
diabetes <- na.omit(diabetes)
dim(diabetes)
## [1] 377  14

Now that our data is clean, we can start doing a more thorough examination of our data.

Correlation Exploration

Normal distribution?

To begin our analysis, we need to understand how the values of our variables are distributed. This will allow us to make informed decisions on the statistical testing required to find correlations. To assess the distribution of our variables, we will perform the following steps:

  1. Histogram Visualization:
    Histograms provide a graphical representation of the distribution of each variable. They allow us to quickly assess whether the data is approximately normal, skewed, or has heavy tails. This visual examination is a preliminary step to identify patterns or irregularities in the data.

  2. Q-Q Plots (Quantile-Quantile Plots):
    Q-Q plots compare the quantiles of our variables against the quantiles of a standard normal distribution. If the data is normally distributed, the points will approximately lie on a straight diagonal line. Deviations from this line indicate departures from normality, such as skewness or heavy tails.

  3. Shapiro-Wilk Test:
    This is a statistical test used to determine if a sample comes from a normally distributed population. It is particularly effective for small to moderate-sized datasets. However, it tends to be overly sensitive for large datasets, often detecting minor deviations from normality that are not practically significant.

  4. Anderson-Darling Test:
    The Anderson-Darling test is an alternative to the Shapiro-Wilk test, especially when we are interested in assessing the fit of the data to a normal distribution with particular sensitivity to deviations in the tails. Unlike the Shapiro-Wilk test, the Anderson-Darling test performs well on both small and large datasets.

Why Multiple Tests?

Using both visualizations (histograms, Q-Q plots) and statistical tests (Shapiro-Wilk, Anderson-Darling) provides a comprehensive assessment of normality. Visual methods offer an intuitive understanding of the data distribution, while statistical tests provide formal evidence to support or reject normality assumptions.

library(nortest)

check_normality <- function(data) {
  # Prepare an empty data frame to store results
  results <- data.frame(Variable = character(), Test = character(), Statistic = numeric(), p_value = numeric())
  
  for (col_name in colnames(data)) {
    if (is.numeric(data[[col_name]])) {
      
      # Shapiro-Wilk Test
      shapiro <- shapiro.test(data[[col_name]])
      results <- rbind(results, data.frame(Variable = col_name, 
                                           Test = "Shapiro-Wilk", 
                                           Statistic = shapiro$statistic, 
                                           p_value = shapiro$p.value))
      
      # Anderson-Darling Test
      anderson_darling <- ad.test(data[[col_name]])
      results <- rbind(results, data.frame(Variable = col_name, 
                                           Test = "Anderson-Darling", 
                                           Statistic = anderson_darling$statistic, 
                                           p_value = anderson_darling$p.value))
      
      # Plot Histogram and QQ plot for each variable
      par(mfrow = c(1, 2))
      hist(data[[col_name]], main = paste("Histogram of", col_name), col = "skyblue", breaks = 20)
      qqnorm(data[[col_name]], main = paste("QQ Plot of", col_name))
      qqline(data[[col_name]], col = "red")
    }
  }
  
  # Reset plot layout to single plot
  par(mfrow = c(1, 1))
  
  # Return the results data frame
  return(results)
}


normality_results <- check_normality(diabetes)

We have visually explored the distribution of our data using histograms and Q-Q plots. This initial inspection provides a rough idea of whether our variables follow a normal distribution. For some variables, such as bp.d and waist_to_hip, the Q-Q plots and histograms suggest a reasonable approximation to normality, but the presence of longer tails warrants closer examination.

On the other hand, variables like glyhb and stab.glu clearly deviate from normality, displaying skewed distributions and notable departures from the Q-Q line.

However, visual inspection alone is not sufficient to conclusively determine normality. Therefore, it is essential to apply formal statistical tests that provide quantitative evidence of normality or lack thereof.

print(normality_results)
##         Variable             Test  Statistic      p_value
## W           chol     Shapiro-Wilk  0.9563425 3.859776e-09
## A           chol Anderson-Darling  2.7910413 4.909949e-07
## W1      stab.glu     Shapiro-Wilk  0.6529111 5.429144e-27
## A1      stab.glu Anderson-Darling 45.4074719 3.700000e-24
## W2           hdl     Shapiro-Wilk  0.9201966 2.885412e-13
## A2           hdl Anderson-Darling  7.7388169 6.630838e-19
## W3         ratio     Shapiro-Wilk  0.8648210 1.378889e-17
## A3         ratio Anderson-Darling  6.1704161 3.476246e-15
## W4         glyhb     Shapiro-Wilk  0.7226508 1.468786e-24
## A4         glyhb Anderson-Darling 37.0011052 3.700000e-24
## W5           age     Shapiro-Wilk  0.9733102 2.044750e-06
## A5           age Anderson-Darling  2.5055371 2.444036e-06
## W6        height     Shapiro-Wilk  0.9877327 2.870518e-03
## A6        height Anderson-Darling  1.8212077 1.159347e-04
## W7        weight     Shapiro-Wilk  0.9647110 6.839620e-08
## A7        weight Anderson-Darling  3.0563492 1.107998e-07
## W8         waist     Shapiro-Wilk  0.9783109 1.940414e-05
## A8         waist Anderson-Darling  2.1253135 2.081800e-05
## W9           hip     Shapiro-Wilk  0.9581573 6.986266e-09
## A9           hip Anderson-Darling  4.4449305 4.778402e-11
## W10          BMI     Shapiro-Wilk  0.9633143 4.124121e-08
## A10          BMI Anderson-Darling  3.2871773 3.040569e-08
## W11 waist_to_hip     Shapiro-Wilk  0.9888348 5.561062e-03
## A11 waist_to_hip Anderson-Darling  0.6480560 9.020888e-02
## W12         bp.s     Shapiro-Wilk  0.9371397 1.573833e-11
## A12         bp.s Anderson-Darling  4.3094905 1.014257e-10
## W13         bp.d     Shapiro-Wilk  0.9945953 2.069029e-01
## A13         bp.d Anderson-Darling  0.6509197 8.874930e-02

These results show the normality testing for different variables using both the Shapiro-Wilk (W) and Anderson-Darling (A) tests.

NOTE

  • p-values < 0.05: The variable significantly deviates from normality.
  • p-values > 0.05: The variable does not significantly deviate from normality, i.e., it could be considered approximately normal.

Variables That Passed Normality Tests (p > 0.05):

  • bp.d: Both tests (p = 0.20 and p = 0.0887) suggest this variable is normally distributed.
    -`waist_to_hip: The Shapiro-Wilk test shows a small deviation (p-value = 0.005), but the Anderson-Darling test (p-value = 0.0902) suggests that the deviation from normality is not strong. This variable only slightly deviates from normality.

Variables That Failed Normality Tests (p < 0.05):

  • All other variables.
  • Very low p-values indicate strong deviation from normality.
  • The Anderson-Darling test is particularly sensitive to deviations in the tails, and many variables show significant issues here.

Correlation Analysis

Now that we have ruled out normality in most of our data, we have valuable information on how to proceed to look for correlations within our variables. Since parametric correlation methods like Pearson’s correlation assume normality and linearity, applying them to non-normally distributed data could produce misleading results.

Instead, we will rely on non-parametric correlation methods, which are more robust to violations of normality and can detect monotonic relationships even if they are not perfectly linear. The two most commonly used methods are:

  • Spearman’s Rank Correlation Coefficient: Measures the strength and direction of the monotonic relationship between two variables. It converts data to ranks, making it appropriate for data that is not normally distributed or contains outliers.
  • Kendall’s Tau: A more conservative non-parametric method that evaluates the ordinal association between two variables. It is particularly useful for smaller datasets or when the data contains many tied ranks.
cor_matrices <- list(
  Spearman = cor(diabetes, use = "pairwise.complete.obs", method = "spearman"),
  Kendall = cor(diabetes, use = "pairwise.complete.obs", method = "kendall")
)

To more comfortably observe relationships within the variables, we will create a heatmap with the correaltion matrices.

library(ggplot2)
library(reshape2)

plot_heatmap <- function(cor_matrix, title) {
  melted_cor <- melt(cor_matrix)
  ggplot(data = melted_cor, aes(x = Var1, y = Var2, fill = value)) +
    geom_tile() +
    scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(title = title, fill = "Correlation")
}

plot_heatmap(cor_matrices$Spearman, "Spearman Correlation Matrix")

plot_heatmap(cor_matrices$Kendall, "Kendall Correlation Matrix")

Firsty, we observe that Kendall’s Tau identifies less strong correlations within the data than Spearman.

We can observe some variables that are strongly correlated with each other, either directly or indirectly. Some correlations are not surprising, such us hip to waist, weight to waistand hip. We do not see very high correlations between GlyHb and the other variables, except in the case of stabilized glucose levels (stab.glu). This makes sense because, in general, glycated hemoglobin is an indicator of long-term glycemic control and reflects the average blood glucose exposure over a prolonged period, approximately the last 2 to 3 months. The second variable with a moderately positive correlation is the cholesterol-to-HDL ratio (ratio). We observe other variables that are positively related to a lesser degree.

We can now check which correlations are statistically significant.

get_significance <- function(data, method = "kendall") {
  n <- ncol(data)
  sig_matrix <- matrix(NA, n, n)
  rownames(sig_matrix) <- colnames(data)
  colnames(sig_matrix) <- colnames(data)
  
  for (i in 1:(n - 1)) {
    for (j in (i + 1):n) {
      test <- cor.test(data[[i]], data[[j]], method = method)
      sig_matrix[i, j] <- test$p.value
      sig_matrix[j, i] <- test$p.value
    }
  }
  return(sig_matrix)
}


get_significance(diabetes, method = "spearman")
## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties

## Warning in cor.test.default(data[[i]], data[[j]], method = method): Cannot
## compute exact p-value with ties
##                      chol     stab.glu          hdl        ratio        glyhb
## chol                   NA 4.517672e-03 7.610405e-03 7.101661e-16 6.041701e-06
## stab.glu     4.517672e-03           NA 1.126925e-04 4.618470e-07 2.413169e-28
## hdl          7.610405e-03 1.126925e-04           NA 1.389484e-92 8.126834e-05
## ratio        7.101661e-16 4.618470e-07 1.389484e-92           NA 1.130869e-09
## glyhb        6.041701e-06 2.413169e-28 8.126834e-05 1.130869e-09           NA
## age          1.073414e-08 3.373630e-11 3.414139e-01 1.294828e-04 2.991115e-18
## height       2.556725e-02 9.958378e-01 1.092279e-02 2.182212e-01 6.745027e-01
## weight       2.904516e-01 2.541528e-04 1.005076e-10 2.474987e-12 4.508379e-05
## waist        3.957307e-02 1.294796e-06 1.374631e-10 5.872831e-14 1.970517e-10
## hip          5.207256e-02 1.097362e-04 2.767752e-06 2.997598e-09 1.257711e-05
## BMI          9.596863e-02 1.040834e-04 8.292849e-10 1.929830e-12 2.274646e-05
## waist_to_hip 6.436984e-02 6.038970e-04 5.580024e-05 1.199265e-06 7.482895e-07
## bp.s         4.752879e-05 5.244497e-07 8.903504e-01 5.394255e-02 1.331098e-07
## bp.d         2.224756e-04 1.309166e-01 1.752964e-01 5.912074e-01 1.969302e-01
##                       age       height        weight         waist
## chol         1.073414e-08 2.556725e-02  2.904516e-01  3.957307e-02
## stab.glu     3.373630e-11 9.958378e-01  2.541528e-04  1.294796e-06
## hdl          3.414139e-01 1.092279e-02  1.005076e-10  1.374631e-10
## ratio        1.294828e-04 2.182212e-01  2.474987e-12  5.872831e-14
## glyhb        2.991115e-18 6.745027e-01  4.508379e-05  1.970517e-10
## age                    NA 1.023866e-01  6.584764e-01  1.062139e-03
## height       1.023866e-01           NA  7.772381e-08  2.597328e-01
## weight       6.584764e-01 7.772381e-08            NA 7.795145e-104
## waist        1.062139e-03 2.597328e-01 7.795145e-104            NA
## hip          5.053197e-01 1.727417e-02  1.443625e-88  1.140091e-96
## BMI          8.925369e-01 1.866302e-01 1.052505e-240 1.678204e-115
## waist_to_hip 1.411603e-07 9.324900e-08  4.226745e-08  9.510770e-28
## bp.s         6.339951e-22 8.921703e-01  7.490632e-03  5.686684e-06
## bp.d         5.243543e-02 2.813468e-01  1.315254e-04  4.140181e-05
##                        hip           BMI waist_to_hip         bp.s         bp.d
## chol          5.207256e-02  9.596863e-02 6.436984e-02 4.752879e-05 2.224756e-04
## stab.glu      1.097362e-04  1.040834e-04 6.038970e-04 5.244497e-07 1.309166e-01
## hdl           2.767752e-06  8.292849e-10 5.580024e-05 8.903504e-01 1.752964e-01
## ratio         2.997598e-09  1.929830e-12 1.199265e-06 5.394255e-02 5.912074e-01
## glyhb         1.257711e-05  2.274646e-05 7.482895e-07 1.331098e-07 1.969302e-01
## age           5.053197e-01  8.925369e-01 1.411603e-07 6.339951e-22 5.243543e-02
## height        1.727417e-02  1.866302e-01 9.324900e-08 8.921703e-01 2.813468e-01
## weight        1.443625e-88 1.052505e-240 4.226745e-08 7.490632e-03 1.315254e-04
## waist         1.140091e-96 1.678204e-115 9.510770e-28 5.686684e-06 4.140181e-05
## hip                     NA 4.759370e-120 7.947668e-01 4.513842e-04 5.181553e-05
## BMI          4.759370e-120            NA 5.355218e-06 2.884800e-03 5.310051e-05
## waist_to_hip  7.947668e-01  5.355218e-06           NA 3.477124e-03 2.018276e-01
## bp.s          4.513842e-04  2.884800e-03 3.477124e-03           NA 5.176381e-40
## bp.d          5.181553e-05  5.310051e-05 2.018276e-01 5.176381e-40           NA
get_significance(diabetes, method = "kendall")
##                      chol     stab.glu          hdl        ratio        glyhb
## chol                   NA 6.382198e-03 6.293696e-03 6.529747e-16 7.475404e-06
## stab.glu     6.382198e-03           NA 1.173264e-04 5.035866e-07 7.217143e-28
## hdl          6.293696e-03 1.173264e-04           NA 2.031745e-73 9.132880e-05
## ratio        6.529747e-16 5.035866e-07 2.031745e-73           NA 2.143410e-09
## glyhb        7.475404e-06 7.217143e-28 9.132880e-05 2.143410e-09           NA
## age          1.413286e-08 1.174459e-10 3.112328e-01 1.007381e-04 4.718043e-17
## height       2.455952e-02 9.803541e-01 1.113324e-02 2.146913e-01 6.737615e-01
## weight       2.732637e-01 2.795915e-04 1.505147e-10 1.204130e-11 5.808091e-05
## waist        3.500692e-02 1.541636e-06 2.405686e-10 3.338327e-13 3.113901e-10
## hip          5.330542e-02 1.066166e-04 3.209184e-06 7.294977e-09 1.606475e-05
## BMI          8.616310e-02 1.010760e-04 8.880549e-10 6.356182e-12 3.550117e-05
## waist_to_hip 5.417245e-02 6.040125e-04 4.411779e-05 9.070827e-07 9.448597e-07
## bp.s         4.146121e-05 1.270779e-06 8.692446e-01 5.273371e-02 2.179977e-07
## bp.d         2.478486e-04 1.445828e-01 1.725065e-01 5.967590e-01 1.886785e-01
##                       age       height        weight        waist          hip
## chol         1.413286e-08 2.455952e-02  2.732637e-01 3.500692e-02 5.330542e-02
## stab.glu     1.174459e-10 9.803541e-01  2.795915e-04 1.541636e-06 1.066166e-04
## hdl          3.112328e-01 1.113324e-02  1.505147e-10 2.405686e-10 3.209184e-06
## ratio        1.007381e-04 2.146913e-01  1.204130e-11 3.338327e-13 7.294977e-09
## glyhb        4.718043e-17 6.737615e-01  5.808091e-05 3.113901e-10 1.606475e-05
## age                    NA 9.871842e-02  6.379594e-01 9.533447e-04 5.141915e-01
## height       9.871842e-02           NA  1.519901e-07 2.816247e-01 1.382471e-02
## weight       6.379594e-01 1.519901e-07            NA 4.634765e-80 1.497740e-71
## waist        9.533447e-04 2.816247e-01  4.634765e-80           NA 2.137535e-76
## hip          5.141915e-01 1.382471e-02  1.497740e-71 2.137535e-76           NA
## BMI          8.693605e-01 2.191128e-01 1.338262e-137 5.334134e-86 4.183931e-90
## waist_to_hip 1.787569e-07 1.339103e-07  4.794731e-08 2.536168e-28 5.386411e-01
## bp.s         6.533636e-21 9.000141e-01  7.007826e-03 6.061786e-06 3.493373e-04
## bp.d         5.143506e-02 2.990254e-01  1.071346e-04 4.023362e-05 4.365406e-05
##                        BMI waist_to_hip         bp.s         bp.d
## chol          8.616310e-02 5.417245e-02 4.146121e-05 2.478486e-04
## stab.glu      1.010760e-04 6.040125e-04 1.270779e-06 1.445828e-01
## hdl           8.880549e-10 4.411779e-05 8.692446e-01 1.725065e-01
## ratio         6.356182e-12 9.070827e-07 5.273371e-02 5.967590e-01
## glyhb         3.550117e-05 9.448597e-07 2.179977e-07 1.886785e-01
## age           8.693605e-01 1.787569e-07 6.533636e-21 5.143506e-02
## height        2.191128e-01 1.339103e-07 9.000141e-01 2.990254e-01
## weight       1.338262e-137 4.794731e-08 7.007826e-03 1.071346e-04
## waist         5.334134e-86 2.536168e-28 6.061786e-06 4.023362e-05
## hip           4.183931e-90 5.386411e-01 3.493373e-04 4.365406e-05
## BMI                     NA 3.983044e-06 2.657579e-03 4.201644e-05
## waist_to_hip  3.983044e-06           NA 4.303401e-03 2.271859e-01
## bp.s          2.657579e-03 4.303401e-03           NA 1.413531e-36
## bp.d          4.201644e-05 2.271859e-01 1.413531e-36           NA

When ckecking correlations with Spearsman, we get warned about ties. This means that multiple samples have the same values for some variables, which can make Spearsman’s test slighly biased. When data has a significant number of ties, Kendall’s Tau is generally more reliable than Spearman’s Rank Correlation because:

  1. Direct Pair Comparison: Kendall’s Tau compares all possible pairs of data points and counts how many pairs are in the same order.
  2. Less Sensitive to Ties: Unlike Spearman, it doesn’t rely on assigning ranks directly, but rather on the order of pairs, making it more robust when ties are present.
  3. More Conservative: It often produces lower correlation coefficients (as we have noticed when comparing the heatmaps for Spearman and Kendall), but they’re usually more accurate and reliable when ties are present.

Since our data has many ties, we will rely on Kendall’s Tau.

Let’s print the results once more:

get_significance(diabetes, method = "kendall")
##                      chol     stab.glu          hdl        ratio        glyhb
## chol                   NA 6.382198e-03 6.293696e-03 6.529747e-16 7.475404e-06
## stab.glu     6.382198e-03           NA 1.173264e-04 5.035866e-07 7.217143e-28
## hdl          6.293696e-03 1.173264e-04           NA 2.031745e-73 9.132880e-05
## ratio        6.529747e-16 5.035866e-07 2.031745e-73           NA 2.143410e-09
## glyhb        7.475404e-06 7.217143e-28 9.132880e-05 2.143410e-09           NA
## age          1.413286e-08 1.174459e-10 3.112328e-01 1.007381e-04 4.718043e-17
## height       2.455952e-02 9.803541e-01 1.113324e-02 2.146913e-01 6.737615e-01
## weight       2.732637e-01 2.795915e-04 1.505147e-10 1.204130e-11 5.808091e-05
## waist        3.500692e-02 1.541636e-06 2.405686e-10 3.338327e-13 3.113901e-10
## hip          5.330542e-02 1.066166e-04 3.209184e-06 7.294977e-09 1.606475e-05
## BMI          8.616310e-02 1.010760e-04 8.880549e-10 6.356182e-12 3.550117e-05
## waist_to_hip 5.417245e-02 6.040125e-04 4.411779e-05 9.070827e-07 9.448597e-07
## bp.s         4.146121e-05 1.270779e-06 8.692446e-01 5.273371e-02 2.179977e-07
## bp.d         2.478486e-04 1.445828e-01 1.725065e-01 5.967590e-01 1.886785e-01
##                       age       height        weight        waist          hip
## chol         1.413286e-08 2.455952e-02  2.732637e-01 3.500692e-02 5.330542e-02
## stab.glu     1.174459e-10 9.803541e-01  2.795915e-04 1.541636e-06 1.066166e-04
## hdl          3.112328e-01 1.113324e-02  1.505147e-10 2.405686e-10 3.209184e-06
## ratio        1.007381e-04 2.146913e-01  1.204130e-11 3.338327e-13 7.294977e-09
## glyhb        4.718043e-17 6.737615e-01  5.808091e-05 3.113901e-10 1.606475e-05
## age                    NA 9.871842e-02  6.379594e-01 9.533447e-04 5.141915e-01
## height       9.871842e-02           NA  1.519901e-07 2.816247e-01 1.382471e-02
## weight       6.379594e-01 1.519901e-07            NA 4.634765e-80 1.497740e-71
## waist        9.533447e-04 2.816247e-01  4.634765e-80           NA 2.137535e-76
## hip          5.141915e-01 1.382471e-02  1.497740e-71 2.137535e-76           NA
## BMI          8.693605e-01 2.191128e-01 1.338262e-137 5.334134e-86 4.183931e-90
## waist_to_hip 1.787569e-07 1.339103e-07  4.794731e-08 2.536168e-28 5.386411e-01
## bp.s         6.533636e-21 9.000141e-01  7.007826e-03 6.061786e-06 3.493373e-04
## bp.d         5.143506e-02 2.990254e-01  1.071346e-04 4.023362e-05 4.365406e-05
##                        BMI waist_to_hip         bp.s         bp.d
## chol          8.616310e-02 5.417245e-02 4.146121e-05 2.478486e-04
## stab.glu      1.010760e-04 6.040125e-04 1.270779e-06 1.445828e-01
## hdl           8.880549e-10 4.411779e-05 8.692446e-01 1.725065e-01
## ratio         6.356182e-12 9.070827e-07 5.273371e-02 5.967590e-01
## glyhb         3.550117e-05 9.448597e-07 2.179977e-07 1.886785e-01
## age           8.693605e-01 1.787569e-07 6.533636e-21 5.143506e-02
## height        2.191128e-01 1.339103e-07 9.000141e-01 2.990254e-01
## weight       1.338262e-137 4.794731e-08 7.007826e-03 1.071346e-04
## waist         5.334134e-86 2.536168e-28 6.061786e-06 4.023362e-05
## hip           4.183931e-90 5.386411e-01 3.493373e-04 4.365406e-05
## BMI                     NA 3.983044e-06 2.657579e-03 4.201644e-05
## waist_to_hip  3.983044e-06           NA 4.303401e-03 2.271859e-01
## bp.s          2.657579e-03 4.303401e-03           NA 1.413531e-36
## bp.d          4.201644e-05 2.271859e-01 1.413531e-36           NA

Key findings

The analysis reveals several biologically relevant associations between the variables studied. Cholesterol levels show significant correlations with glycemic control (glyhb) and body weight, which aligns with established relationships between metabolic health and lipid profiles. Glycated hemoglobin (glyhb) is notably correlated with variables such as waist circumference, hip circumference, BMI, and waist-to-hip ratio, all of which are known indicators of metabolic syndrome and insulin resistance. The robust correlation between glyhb and these anthropometric measures reinforces their importance as risk factors for metabolic disorders.

Blood pressure (both systolic and diastolic) exhibits significant associations with weight, waist circumference, and BMI. This relationship supports the well-documented link between obesity and hypertension. Additionally, the association of blood pressure with glycated hemoglobin suggests potential interactions between cardiovascular health and blood glucose regulation.

Interestingly, cholesterol levels also show correlations with various anthropometric measures, albeit to a lesser degree. The observed associations could indicate underlying metabolic processes linking lipid metabolism, body composition, and cardiovascular risk.

Principal Component Anaysis

Principal Component Analysis (PCA) is a powerful dimensionality reduction technique commonly used in data analysis and machine learning. Its primary purpose is to reduce the number of features (variables) in a dataset while preserving as much of the original information as possible.

PCA achieves dimensionality reduction by transforming the original variables into a new set of uncorrelated variables, called Principal Components (PCs). These components are linear combinations of the original variables and are ranked in order of the amount of variance they explain in the data:

  1. PC1 (First Principal Component): The direction in which the data varies the most.
  2. PC2 (Second Principal Component): The direction of the second-largest variance, orthogonal to PC1.
  3. And so on…

The idea is that a small number of these principal components can capture most of the variability present in the dataset.

In this study, PCA will be used to reduce the dimensionality of the dataset, making visualization of classification boundaries feasible and enhancing our understanding of how the different classes (Diabetic, Non-Diabetic, Prediabetic) are distributed in the feature space.

#install.packages("factoextra")
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Performing PCA
pca_result <- prcomp(diabetes, scale = TRUE)

# Scree Plot
fviz_eig(pca_result, addlabels = TRUE, ylim = c(0, 50))

# Inspecting Loadings
pca_result$rotation
##                      PC1         PC2         PC3          PC4         PC5
## chol          0.12106437  0.32852784  0.09540116 -0.261180432  0.26120018
## stab.glu      0.20339463  0.35474272 -0.21294936 -0.093844621 -0.43740303
## hdl          -0.20747129  0.05412221  0.44894797  0.005316485 -0.42698331
## ratio         0.25792699  0.20769091 -0.35620251 -0.139600037  0.53159280
## glyhb         0.21178246  0.39656574 -0.19119272 -0.134487043 -0.36680966
## age           0.10231727  0.43875964  0.12847883 -0.010736354 -0.07787485
## height        0.06017702 -0.01502318 -0.19320881  0.680602238 -0.03807773
## weight        0.42144152 -0.26130439  0.03669510  0.074855796 -0.07904174
## waist         0.43972728 -0.12740608  0.07748852  0.059980001 -0.08837684
## hip           0.38264738 -0.26267308  0.15638117 -0.273762712 -0.08644486
## BMI           0.42168038 -0.26818929  0.08148923 -0.067661076 -0.07408503
## waist_to_hip  0.20680716  0.17428340 -0.10248219  0.529440313 -0.02414247
## bp.s          0.14444393  0.31440355  0.46903073  0.114803693  0.18212406
## bp.d          0.12892566  0.12451102  0.50772687  0.199569853  0.26965213
##                       PC6         PC7          PC8          PC9         PC10
## chol          0.705321921  0.12688528 -0.017032969 -0.056273090  0.056140413
## stab.glu     -0.056658388 -0.31109247  0.137363186 -0.049290548  0.679964826
## hdl           0.459112314  0.13362516  0.006951515 -0.060631175  0.010991480
## ratio         0.095505846 -0.03331440 -0.022922154 -0.020791684  0.039042914
## glyhb        -0.001285710 -0.25503988  0.136037520  0.048012644 -0.720018943
## age          -0.271081058  0.42748164 -0.573313561  0.424678793  0.055398342
## height        0.313971199 -0.32091486 -0.481607058 -0.014277984 -0.043556356
## weight        0.118780957 -0.05278748 -0.123655082  0.011959536  0.034773355
## waist        -0.019996696  0.25303886  0.121538380 -0.043747817 -0.037528189
## hip          -0.031172534 -0.02854156 -0.170078493  0.002869109 -0.050118663
## BMI           0.055359056  0.01226860 -0.036546153  0.021796688  0.040238278
## waist_to_hip  0.004124864  0.50690277  0.471953518 -0.057231493  0.005012728
## bp.s         -0.295429436 -0.14433118 -0.131834949 -0.694267688 -0.036110116
## bp.d         -0.038731416 -0.41451369  0.317304949  0.565341401  0.037146983
##                     PC11          PC12          PC13          PC14
## chol          0.03779151  0.4585338359 -0.0013531162 -1.496033e-04
## stab.glu      0.05820936  0.0163813461 -0.0034398400 -2.236330e-03
## hdl          -0.03446568 -0.5800369281 -0.0035681033  7.909653e-04
## ratio        -0.01692908 -0.6696090838 -0.0012038277 -1.092763e-03
## glyhb        -0.06575122  0.0091542984  0.0005846927 -1.000524e-06
## age          -0.07691537 -0.0007675883  0.0088399172  2.212614e-03
## height        0.19654632  0.0196292506 -0.0347232709 -1.353586e-01
## weight       -0.45867671  0.0190344508  0.1758731269  6.824173e-01
## waist         0.44465544 -0.0266453685  0.6806984297 -1.693929e-01
## hip           0.53513029 -0.0417907359 -0.5754624745  1.587732e-01
## BMI          -0.49298239  0.0152660734 -0.1732957924 -6.735072e-01
## waist_to_hip -0.01990592  0.0210048243 -0.3783662012  9.196216e-02
## bp.s         -0.06982520  0.0150908588 -0.0064336114 -8.540078e-04
## bp.d          0.04548282 -0.0216511145  0.0039231119  6.319832e-04
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Extracting the loadings (arrows)
loadings <- as.data.frame(pca_result$rotation[, 1:3])
pca_scores <- as.data.frame(pca_result$x[, 1:3])

# Scale the loadings for better visualization
loadings$PC1 <- loadings$PC1 * max(abs(pca_scores$PC1))
loadings$PC2 <- loadings$PC2 * max(abs(pca_scores$PC2))
loadings$PC3 <- loadings$PC3 * max(abs(pca_scores$PC3))

# Create the 3D scatter plot with PCA scores
fig <- plot_ly() %>%
  add_trace(
    data = pca_scores,
    x = ~PC1,
    y = ~PC2,
    z = ~PC3,
    type = 'scatter3d',
    mode = 'markers',
    marker = list(size = 3, color = 'orange'),
    name = 'Data Points'
  )

# Adding loadings as arrows, one by one
for (i in 1:nrow(loadings)) {
  fig <- fig %>%
    add_trace(
      x = c(0, loadings$PC1[i]),
      y = c(0, loadings$PC2[i]),
      z = c(0, loadings$PC3[i]),
      type = 'scatter3d',
      mode = 'lines+text',
      line = list(color = 'steelblue', width = 4),
      text = rownames(loadings)[i],
      textposition = 'top center',
      name = paste("Loading", rownames(loadings)[i])
    )
}

# Layout
fig <- fig %>%
  layout(
    title = "3D PCA Plot with Loadings",
    scene = list(
      xaxis = list(title = 'PC1'),
      yaxis = list(title = 'PC2'),
      zaxis = list(title = 'PC3')
    )
  )

fig
  1. Scree Plot The scree plot shows the percentage of variance explained by each of the principal components (PCs). The purpose of this plot is to help determine how many components are worth keeping for further analysis.
  • PC1 (31.1%), PC2 (16.4%) and PC3 (12.6%) explain the most variance, totaling about 60%.
  • The first few components explain the majority of the variance, and the contribution decreases with each subsequent component.
  • By the time we reach PC8–PC10, the variance explained is minimal, suggesting these components contribute very little to the overall information.

2. PCA 3D plot

This is a biplot, which shows the projection of the original variables (arrows) and data points (orange dots) in the space defined by the first three principal components. The length and direction of each arrow indicate how strongly each original variable contributes to the PCs and their correlations. - Variables pointing in the same direction are positively correlated. - Variables pointing in opposite directions are negatively correlated. - Longer arrows indicate stronger contributions to the PCs.

3. Loadings Table

This table shows the contribution of each original variable to the principal components (loadings). High loadings (absolute values) indicate strong relationships between the variable and the principal component.

  • For PC1 (31.1%):
    • Weight (0.417), Waist (0.437), Hip (0.380), BMI (0.418) are heavily loaded.
    • This suggests PC1 mostly captures the body size or obesity-related factors.
  • For PC2 (16.4%):
    • Age (0.427), Glyhb (0.407), Stab.glu (0.360) contribute heavily.
    • Suggests PC2 is capturing aspects related to age and glucose metabolism.
  • For PC3 (12.6%):
    • Bp.s (0.469), Bp.d (0.508), Hdl (0.449) are the most heavily loaded variables.
    • This suggests PC3 is primarily capturing blood pressure and cholesterol-related factors.
  • PC4 to PC10:
    • Contribute less to the overall variance.
    • Each component captures smaller, more specific patterns not captured by the previous principal components.

K-Clustering

In this section, we proceed with categorizing the dataset into three classes: Non-Diabetic, Prediabetic, and Diabetic based on the key variable stab.glu (stable glucose). After categorization and PCA, we proceed to reduce the dimensionality of the data, making visualization and further analysis more efficient.

NOTE
Having an actual diagnosis of diabetes would be ideal, but since this is missing, we will create it based soley on stab.glu so that we can proceed with a simple machine learning analysis.

Additionally, we employ K-means clustering to identify natural groupings within the data and compare these clusters with the labeled categories. This helps us gain insights into how well the clustering algorithm identifies patterns that correspond to the diabetes status categories.

# Categorize stab.glu into Non-Diabetic, Prediabetic, and Diabetic

diabetes$DiabetesStatus <- cut(diabetes$stab.glu,
                           breaks = c(-Inf, 99, 125, Inf),
                           labels = c("Non-Diabetic", "Prediabetic", "Diabetic"))

# Extracting the relevant columns (standardizing them)
data_scaled <- scale(diabetes[, c('chol', 'stab.glu', 'hdl', 'ratio', 'glyhb', 'age', 'height', 'weight', 'waist', 'hip', 'BMI', 'waist_to_hip', 'bp.s', 'bp.d')])

# PCA
pca_res <- prcomp(data_scaled, scale. = TRUE)
pca_data <- as.data.frame(pca_res$x)
pca_data$DiabetesStatus <- diabetes$DiabetesStatus

# K-means clustering with 3 clusters
set.seed(123)
kmeans_res <- kmeans(pca_data[, c('PC1', 'PC2', 'PC3')], centers = 3)
pca_data$Cluster <- as.factor(kmeans_res$cluster)

# Interactive 3D Plot with Plotly (Diabetes Status)
plot1 <- plot_ly(pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~DiabetesStatus, colors = c('red', 'green', 'blue')) %>%
  add_markers() %>%
  layout(title = "3D PCA Plot with Diabetes Status",
         scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2'),
                      zaxis = list(title = 'PC3')))

# Interactive 3D Plot with Plotly (K-means Clustering)
plot2 <- plot_ly(pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Cluster, colors = c('red', 'green', 'blue')) %>%
  add_markers() %>%
  layout(title = "3D K-means Clustering on PCA-Reduced Data",
         scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2'),
                      zaxis = list(title = 'PC3')))

plot1
plot2

PCA Scatter Plot with Diabetes Status
- The plot shows data points reduced to three principal components (PC1, PC2 and PC3) and colored by diabetes status:
- It seems like there’s some overlap between groups, especially between Non-Diabetic and Prediabetic points. However, the Diabetic group (blue) appears more separated and tends to occupy one side of the plot, which suggests some level of distinguishability.

K-means Clustering on PCA-Reduced Data
- This scatter plot shows clusters formed using K-means clustering on the reduced data.
- The clustering seems to capture some meaningful structure:
- Blue Cluster (Cluster 3) appears to align somewhat with the Diabetic points from the previous image.
- Green Cluster (Cluster 2) seems to represent the middle ground, likely containing mostly Prediabetic points.
- Red Cluster (Cluster 1) seems to represent Non-Diabetic points, although there is still some overlap.

Key findings

  • PCA is successful in reducing dimensions while preserving some structure in the data.
  • K-means clustering shows a reasonable separation between groups, especially with the Diabetic group (blue cluster) being more distinct.
  • Overlap exists between Non-Diabetic and Prediabetic points, which might indicate that stable glucose alone is not sufficient for a perfect classification but provides valuable information.
  • The clustering seems to identify broader patterns but could benefit from more advanced classification models to improve the accuracy of distinguishing between these categories.

Machine Learning algorithms

Lastly, we will train 3 models on the data and evaluate how well they perform when classifying new data into diabetic, pre-diabetic or non diabetic.

k-Nearest Neighbors (kNN):
kNN is a simple, instance-based learning algorithm used for classification and regression. It works by finding the ‘k’ most similar data points (neighbors) to a given input and making predictions based on their labels.

Support Vector Machine (SVM):
SVM is a powerful supervised learning algorithm used for classification and regression. It works by finding the optimal hyperplane that best separates data points of different classes. By using kernels, it can model complex, non-linear boundaries. SVMs are robust to high-dimensional data but can be sensitive to noise.

Decision Trees:

Decision Trees are a flowchart-like model used for classification and regression. They split data into subsets based on the most informative features, making decisions at each node. They are easy to visualize and interpret but prone to overfitting if not properly pruned.

library(ggplot2)
library(dplyr)
library(tidyr)
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
## 
##     smiths
library(factoextra)
library(cluster)
library(plotly)
library(caret)
## Loading required package: lattice
# Split data into training and testing
set.seed(123)
train_index <- createDataPartition(pca_data$DiabetesStatus, p = 0.7, list = FALSE)
train_data <- pca_data[train_index, ]
test_data <- pca_data[-train_index, ]

# Define the models to train
models <- list(
  SVM = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'svmRadial'),
  KNN = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'knn', tuneLength = 5),
  DecisionTree = train(DiabetesStatus ~ PC1 + PC2 + PC3, data = train_data, method = 'rpart')
)

# Predict on test data & calculate performance metrics
results <- lapply(models, function(model) {
  predictions <- predict(model, test_data)
  cm <- confusionMatrix(predictions, test_data$DiabetesStatus)
  data.frame(Accuracy = cm$overall['Accuracy'])
})

# Combine results into a single data frame for comparison
results_df <- do.call(rbind, results)
results_df <- cbind(Model = names(models), results_df)
print(results_df)
##                     Model  Accuracy
## SVM                   SVM 0.7321429
## KNN                   KNN 0.7678571
## DecisionTree DecisionTree 0.7232143
# Plotly 3D Scatter Plot with Decision Boundaries (KNN Example)
grid <- expand.grid(PC1 = seq(min(pca_data$PC1), max(pca_data$PC1), length.out = 50),
                    PC2 = seq(min(pca_data$PC2), max(pca_data$PC2), length.out = 50),
                    PC3 = seq(min(pca_data$PC3), max(pca_data$PC3), length.out = 50))

predictions <- predict(models$KNN, grid)
grid$Prediction <- predictions

plot_ly() %>%
  add_markers(data = pca_data, x = ~PC1, y = ~PC2, z = ~PC3, color = ~DiabetesStatus, colors = c('red', 'green', 'blue'),
              marker = list(size = 3)) %>%
  add_markers(data = grid, x = ~PC1, y = ~PC2, z = ~PC3, color = ~Prediction, opacity = 0.2, marker = list(size = 2)) %>%
  layout(title = '3D PCA Plot with Decision Boundaries',
         scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2'),
                      zaxis = list(title = 'PC3')))